Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Nov 22 16:26:48 2021"
“How are you feeling right now? What do you expect to learn? Where did you hear about the course?”
I am extremely excited about this course! I have just started my PhD studies at the Institute of Atmospheric and Earth System Research. My research topic is related to building a framework for climate competencies, based mostly on interviews and surveys. Since my background is in natural sciences I have very little to no experience on behavioral sciences and their methodology. For this reason, I have chosen this course. It was suggested to me by our colleagues in the faculty of educational sciences.
I expect to learn to use R, which I’m not too worried about since I have experience in Python, and even more importantly, the statistical practices and rules that apply in human sciences, when it’s not all about nature and numbers.
As text: https://github.com/joulasip/IODS-project.git
date()
## [1] "Mon Nov 22 16:26:48 2021"
The data includes survey results from statistics course students in Finland regarding their learning. Data is collected 2014-2015. After combining variables that measure the same dimension and excluding observations where the exam points are zero, the data consists of 166 observations and 7 variables listed here:
The three learning approaches combine 8 subscales that all include 4 questions:
Reading the data and checking the structure (working diary being IODS-project):
data <- read.csv("data/learning2014", row.names = 1)
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Summary of the data — basic statistics of the variables:
summary(data)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Graphical overview using libraries GGally and ggplot2. The first variable, gender, is defining the colour and alpha sets transparency in the graphs so that the overlapping data can be seen:
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
p <- ggpairs(data, mapping = aes(col=data$gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
The correlation coefficient, corr, refers to Pearson correlation by default — correlations equal to +1 or −1 correspond to data points lying exactly on a line. Stars in the end the correlation value represent the p-values (as explained here):
There are almost twice as many female students than male. Majority of the students are between 20 and 30 years old. Male students have generally better attitude than female.
Points and attitude have the highest positive correlation 0.437, which is also significant (p < 0.001). Highest significant negative correlation can be seen between surface and deep learning approaches with corr = -0.324 (male up to -0.622). This is to be expected since they could be considered the opposite.
Strategic learning approach is slightly more common among female students than male.
Less significant negative correlation can be seen between surface and strategic learning approaches, corr = -0.161, and between surface learning approach and attitude, corr = -0.176.
The distribution of deep learning approach is shifted more towards higher values than with strategic or surface approach. The mean and median are approx. 3.6 — for strategic approach approx 3.1 and surface approach 2.8.
To me it seems odd, that the points do not correlate with deep learning approach at all. Could this mean that the exam/assignments do not measure the deep learning? Or that the deep learning approach does not in fact lead to learning? There is more evidence (all though no significance shown) that the surface learning approach has some negative correlation and strategic approach positive with the points.
Since the points have a highly significant correlation with attitude and a slightly significant correlation with surface and strategic learning approaches, I have chosen those three for the multiple regression as explanatory variables to the target variable “points”.
my_model_multi <- lm(points ~ attitude + stra + surf, data = data)
summary(my_model_multi)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The following formula describes the multiple linear model:
\(y = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \varepsilon_i\)
where \(\beta_0\) is the Intersept, \(\beta_1\) regression coefficient (Estimate) for attitude, \(\beta_2\) for strategic approach and \(\beta_3\) for surface approach. These measure the change in the mean response associated with a change in the corresponding explanatory variable. \(\varepsilon_i\) refers to the error terms (Std. Error) that are assumed to have a normal distribution with zero mean and the same variance \(\sigma^2\) for all values of the explanatory variables (as explained in the course material book Multivariate Analysis for the Behavioural Sciences).
The residuals (difference between an observed value of the target variable and the fitted value) have a median approx. 0.52. F-statistic gives a test of an omnibus null hypothesis that all the regression coefficients are zero, and it is calculated by dividing regression mean square (RGSM) by residual mean square (RMS). RMS gives an estimate of variance \(\sigma^2\). Since here F-statistic has a very low associated p-value (3.156e-08) it is very unlikely that all of the coefficients are zero — as can be expected.
The t-values are obtained by dividing the estimated regression coefficient by the standard error of the estimate, and the associated significance levels (Pr(>|t|) in the table) can indicate the importance of the explanatory variable.
According to the model, attitude has the strongest and the most significant (p < 0.001) positive correlation with the points. Since only the attitude has a high significance, I will try making models with attitude and surface approach, and attitude and strategic approach separately. The latter summary shown below.
my_model_2 <- lm(points ~ attitude + stra, data = data)
summary(my_model_2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
\(R^2\) gives the proportion of variability in the target variable accounted for by the explanatory variables — in the first case approx. 21% of the variability in points is accounted for by these three explanatory variables: attitude, strategic learning approach and surface learning approach. \(R\) is a measure of the fit of the model, the multiple correlation coefficient.
In the model with only attitude and strategic learning approach as explanatory variables, (shown above) the strategic approach has p-value of 0.09 so still below 0.1, so slightly better. In this model, according to \(R^2\), approximately the same amount of the variability, 20%, can be explained by the combination of attitude and strategic approach than in the previous model with three explanatory variables. This gives confidence that the importance of surface learning approach is very insignificant.
Let’s still try how the model would look with only one explanatory variable, attitude. The resulted summary is shown below — attitude alone counts for 19% of the variability in points.
my_model_3 <- lm(points ~ attitude, data = data)
summary(my_model_3)
##
## Call:
## lm(formula = points ~ attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
It is important to note that the connections between the explanatory variables can also have a major effect on the model, and they need to be taken into account when deciding which variables to use and which leave out! Here the surface and strategic learning approaches do have small negative correlation according to the overview of the data (part 2) but still much less significant than the correlation between attitude and points, so I consider it not important.
I will continue the analysis with the model including attitude and strategic learning approach as the explanatory variables.
par(mfrow = c(2,2))
plot(my_model_2, which = c(1,2,5))
The graph above includes the diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs. Leverage.
Assumptions of the model:
QQ-plot allows exploring the normality assumption — better the points fall into the line, better the assumption. The fit here is rather good especially in the middle, but there seem to be some outliers in both ends of the distribution that fall under the line.
Constant variance of errors and them not being dependent on the values of the explanatory variables can be assessed with the Residuals vs. fitted values graph. Any pattern in the plot would mean issues with this assumption. Here the plot has no clear patterns but the same outliers are marked in this one as well — the observations 145, 56 and 35 seem odd. Based on these two plots, I would look more deeply into those answers and would consider removing those observations all together if a clear reason for their behaviour is found.
Residuals vs leverage graph helps to identify observations that have unusually high impact on the model. Here again few observations stand out, and they are marked with numbers — 145 and 35 are there again. Since Cook’s distance is slightly higher around the same leverage values, it means that these values have a rather large influence on the model and should therefore be looked into more.
Overall the diagnostic plots look still quite good and the model can be considered valid. The same outliers are also present if the diagnostic plots are made for the model with only attitude as explanatory variable. Removing the outliers could lead to larger significance and clearer model results.
date()
## [1] "Mon Nov 22 16:26:54 2021"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(boot)
alc <- read.csv("data/alc", row.names = 1)
dim(alc)
## [1] 370 51
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
The data represents Portuguese secondary school students’ performance including grades, demographic, social and school related features. It was collected by using school reports and questionnaires. Two subjects are combined here: Mathematics and Portuguese language. More information can be found here including explanations of all the variables listed above: https://archive.ics.uci.edu/ml/datasets/Student+Performance
‘Alc_use’ combines weekday (Dalc) and weekend (Walc) alcohol use. ‘High_use’ is defined as TRUE if ‘alc_use’ > 2, so students consumption of alcohol is defined high if they are using alcohol twice a week or more.
Variables that I assume having important correlation to alcohol consumption, and my hypothesis about the connection:
In addition, I will include gender as a divider in the graphs.
Glimpse to the selected data again:
#select the wanted variables to check
sel_alc <- select(alc,one_of(c("G3","goout","absences","age")))
glimpse(sel_alc)
## Rows: 370
## Columns: 4
## $ G3 <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14, 1…
## $ goout <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2, 5…
## $ absences <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1, …
## $ age <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1…
Bar plots of the chosen variables:
gather(sel_alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Small amount of absences are common and only very few students have more than 10 absences. Most studetns are between 15 and 18 years ols and only few are older. Grades seem to follow normal distribution that is somewhat shifted towards better scores — 10 and 12 are the most received grades. Going out with friends is also quite normally distributed — most students responded with 3 or 2.
Making box plots of the variables as high use in the x-axis and gender defining the colour:
library(cowplot)
# initialize the plots for the 4 variables chosen
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("grade") + xlab("high use") + ggtitle("Student grade by alcohol consumption and sex")
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + xlab("high use") + ggtitle("Student absences by alcohol consumption and sex")
g3 <- ggplot(alc, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + ylab("going out") + xlab("high use") + ggtitle("Going out by alcohol consumption and sex")
g4 <- ggplot(alc, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + xlab("high use") + ggtitle("Student age by alcohol consumption and sex")
plot_grid(g1,g2,g3,g4)
Producing summary statistics for each selected variable based on high use of alcohol and gender:
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 11.4
## 2 F TRUE 41 11.8
## 3 M FALSE 105 12.3
## 4 M TRUE 70 10.3
Grades: I hypothesized that the grades would be negatively affected by alcohol use. According to the box plot and the mean grade shown above, this is true only for male students — males not using too much alcohol have higher average grade than females (using alcohol often or not) and the males using alcohol often have notably lower average grade than females. The box plot shows that some male students have received very low grades in case of high alcohol consumption, which brings the average down quite a lot. For female students the variance is smaller, and no outliers are present (see box plot).
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_absences
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 4.25
## 2 F TRUE 41 6.85
## 3 M FALSE 105 2.91
## 4 M TRUE 70 6.1
Absences: I assumed that the higher alcohol use would lead to absences. For both genders this seems to hold true. Male students not using alcohol have very low average of absences, only 2.9, so the difference to the high alcohol users is very clear, approx 3 more days of absences. With female students the difference is approx. 2.6 days. Female students have more absences in general and there are several students with much higher number of absences than the average — all the way to more than 40.
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_goout
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 2.95
## 2 F TRUE 41 3.39
## 3 M FALSE 105 2.70
## 4 M TRUE 70 3.93
Going out: The box plot for going out looks a bit odd due to the likert scale. Median among the less alcohol using students is 3 and those using more than twise a week 4 (male and female). Here also the hypothesis holds. Difference between the mean values is larger for male students according to the summary statistics above. Students going out more are also consuming more alcohol, which makes a lot of sense.
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_age
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 16.6
## 2 F TRUE 41 16.5
## 3 M FALSE 105 16.3
## 4 M TRUE 70 16.9
Age: Mean age of those using alcohol many times a week is slightly smaller for female students and higher for male students than the age of non-alcohol users. This shows clearly in the median in the box plot. Female students maybe start drinking earlier than male students. It seems to hold true to some extent that younger students drink more — however, the students are generally quite young as shown in the bar plots earlier.
Statistically exploring the data with logistic regression:
m <- glm(high_use ~ age + absences + G3 + goout, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ age + absences + G3 + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8967 -0.7525 -0.5409 0.9467 2.2946
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.77834 1.96186 -1.926 0.05412 .
## age 0.04563 0.11022 0.414 0.67890
## absences 0.07315 0.02237 3.270 0.00108 **
## G3 -0.04472 0.03923 -1.140 0.25435
## goout 0.70668 0.11917 5.930 3.03e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 388.67 on 365 degrees of freedom
## AIC: 398.67
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) age absences G3 goout
## -3.77834234 0.04562870 0.07314823 -0.04471930 0.70667982
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02286056 0.0004637374 1.033922
## age 1.04668570 0.8430071337 1.299954
## absences 1.07589001 1.0312932049 1.127070
## G3 0.95626587 0.8852281993 1.032878
## goout 2.02724925 1.6139414455 2.577757
Odds ratios (OR) tell about the odds of high use of alcohol based on the explanatory variable. If the value is over 1, there is a positive effect of the explanatory variable to the target variable (high use). In this case, age, absences and going out increase the changes of high alcohol use, and alcohol use and grades have negative association. However, as the significance test in the model summary shows, only absences and going out have a statistical significance in the model.
In case the value 1 is between the confidence intervals (CI), there is no evidence of an association between the variables. In this case, only the confidence intervals of absences and going out don’t include the value 1, which means that those two variables and high alcohol use have association.
When it comes to the hypothesis before, this goes together with the previous analysis based on box plots. Absences have a clear positive correlation with the high alcohol use, as does going out with friends. Based on the previous, this model would look different regarding the grades if it was made for male students separately, whose grades were notably lower if they were using alcohol more than twice a week.
For this exploration I will include the variables ‘absences’ and ‘goout’ according to the analysis above. Here I will use the model to predict the the actual values, and save and compare the predictions to the actual values. In case probability is larger than 0.5, the high use is considered TRUE.
m2 <- glm(high_use ~ absences + goout, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, goout, high_use, probability, prediction) %>% tail(10)
## absences goout high_use probability prediction
## 361 7 3 TRUE 0.28963176 FALSE
## 362 3 3 TRUE 0.23104046 FALSE
## 363 2 1 TRUE 0.06029226 FALSE
## 364 4 4 TRUE 0.40315734 FALSE
## 365 3 2 FALSE 0.12606082 FALSE
## 366 4 3 TRUE 0.24487648 FALSE
## 367 0 2 FALSE 0.10291931 FALSE
## 368 4 4 TRUE 0.40315734 FALSE
## 369 8 4 TRUE 0.47825016 FALSE
## 370 0 2 FALSE 0.10291931 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 236 23
## TRUE 65 46
The table of the last ten values shows that the model has predicted all of these cases to not have high alcohol consumption based on absences and going out, but the actual observations have 7/10 high alcohol usage. This model is clearly not perfect. Still the table of target variable vs. predictions shows that most cases are predicted correctly by the model.
Then I plot the predictions against the actual values (graphic visualization). If the model was perfect in predicting the actual values, the upper line would have dots only on the right side (probability > 0.5) and lower line only on the left side.
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63783784 0.06216216 0.70000000
## TRUE 0.17567568 0.12432432 0.30000000
## Sum 0.81351351 0.18648649 1.00000000
According to the table, approximately 26 % of the predictions went wrong (the training error).
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2513514
The average number of wrong predictions in the cross validation is aroun 0.24-0.26 (changes with each run). Therefore the model’s prediction capabilities are very close to the model at Datacamp.
END OF WEEK 3!
date()
## [1] "Mon Nov 22 16:26:56 2021"
The data represents Housing Values in Suburbs of Boston from the 70s. Loading the data and taking a look at the data:
library(dplyr)
library(ggplot2)
library(tidyr)
library(boot)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.92 loaded
# load the data
data(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
dim(Boston)
## [1] 506 14
There are 506 observations and the following 14 variables:
Looking at the summary of the data:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Making a correlation matrix and visualizing it:
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
Based on the visual overview the strongest negative correlation is between:
According to this it sounds like the employment centres produce NO emissions, which makes sense if they are based of heavy industry. Also the second is clear: areas with older buildings are built close to the work places and newer buildings are further away since the place is occupied already. More lower status population unsurprisingly correlates with lower value of owner-occupied homes.
And the strongest positive correlation is between:
These are following the earlier: the access to highways indicated smaller distance to employment centres and industry (non-retail) businesses emit NO emissions.
#centre and standardize variables
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
The mean of the variables is now zero since the standardizing includes dividing all the variables with their means.
Next I will create a factor variable of crime rate and remove the old one, as well as split the data into test and train sets for testing of predictions.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low","med_low","med_high","high"))
# removing the old crime rate variable
boston_scaled <- dplyr::select(boston_scaled, -crim)
# adding the new categorized crime rate variable to the standardized dataset
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows
n <- nrow(boston_scaled)
# choosing randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# creating the train set
train <- boston_scaled[ind,]
# ... and the test set
test <- boston_scaled[-ind,]
Then I will fit the linear discriminant analysis to the train data with crime rate as the target variable and all the other variables as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2524752 0.2351485 0.2574257
##
## Group means:
## zn indus chas nox rm age
## low 0.88711774 -0.9227613 -0.157656245 -0.8765262 0.50233351 -0.8455100
## med_low -0.09777286 -0.2993349 -0.002135914 -0.5614436 -0.07205145 -0.3698962
## med_high -0.37711365 0.2397391 0.142102536 0.4438409 0.16888337 0.4160529
## high -0.48724019 1.0149946 -0.045188669 1.0221915 -0.39952243 0.8083092
## dis rad tax ptratio black lstat
## low 0.7884350 -0.7008870 -0.7738451 -0.4718608 0.37648358 -0.76919504
## med_low 0.3117408 -0.5573888 -0.5041183 -0.1184852 0.31977281 -0.17780689
## med_high -0.4358222 -0.4112644 -0.2847119 -0.3971204 0.08086429 -0.07361578
## high -0.8549995 1.6596029 1.5294129 0.8057784 -0.79176554 0.86045092
## medv
## low 0.59413061
## med_low 0.05442224
## med_high 0.21675404
## high -0.66891024
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12123030 0.610895304 -0.97397635
## indus 0.04565686 -0.360594257 0.30828851
## chas -0.08925399 -0.041333545 0.22793753
## nox 0.28083238 -0.796605708 -1.25870966
## rm -0.12017943 -0.117179919 -0.17463934
## age 0.23895204 -0.232733101 -0.23843974
## dis -0.10529709 -0.116187535 0.10670844
## rad 3.71309876 0.882793731 -0.05707155
## tax 0.08214568 -0.021721523 0.38126064
## ptratio 0.10071342 0.184665108 -0.25254036
## black -0.14799936 0.001525138 0.13938999
## lstat 0.15050703 -0.063422576 0.52012330
## medv 0.16359951 -0.261202949 -0.11244441
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9614 0.0290 0.0096
The following represent the LDA biplot:
# making arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plotting the results with arrows
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The correct classes from the test data are saved and then removed here:
# saving the correct classes from test data
correct_classes <- test$crime
# lastly remove the crime variable from test data
test <- dplyr::select(test, -crime)
Then I will use the LDA model for predicting the classes, and cross tabulate the results with the categories from the test set.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 6 0 0
## med_low 2 16 6 0
## med_high 1 17 11 2
## high 0 0 1 22
The category ‘high’ is very well predicted by the model. The observations belonging to the category ‘low’ are in most of the cases predicted as ‘med-low’. In general the model seems pretty good, but due to randomness of defining the test and train groups, the model changes somewhat with every run and therefore the results vary as well. An average over many runs would be needed for more precise estimate of the performance of the model.
I will load the Boston data again and standardize it, after which I will calculate the distances between the observations.
data('Boston')
boston_scaled2 = scale(Boston)
#calculating euclidean distance matrix
dist_eu <- dist(Boston)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
Next the K-means algoritm with 2 clusters is run and the results visualized (only columns 6-10 are plotted to see the results better):
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
Then 3 clusters instead of 2 is used, which looks good and clear groups can be seen:
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
END OF WEEK 4!
(more chapters to be added similarly as we proceed with the course!)